multinomial (Multinomial distribution)#

The multinomial distribution models counts across K categories when you run n independent categorical trials with fixed category probabilities.

This notebook uses the same parameterization as scipy.stats.multinomial:

  • n = total number of trials (a non-negative integer)

  • p = probability vector of length K with entries in ([0,1]) that sums to 1

Learning goals#

By the end you should be able to:

  • recognize when a multinomial model is appropriate (and when it isn’t)

  • write down the PMF and a useful notion of a multivariate CDF

  • compute and interpret the mean vector, covariance, and key properties

  • derive the likelihood and the MLE for p

  • sample from a multinomial using NumPy-only algorithms

  • visualize the PMF/CDF for K=3 on the simplex (triangle)

  • use scipy.stats.multinomial for PMF/moments/sampling and implement missing pieces (CDF/fit) yourself

Prerequisites#

  • Basic probability (PMF/CDF), expectation, variance, covariance

  • Multinomial coefficients / factorials

  • Comfort with logs and derivatives

Table of contents#

  1. Title & Classification

  2. Intuition & Motivation

  3. Formal Definition

  4. Moments & Properties

  5. Parameter Interpretation

  6. Derivations

  7. Sampling & Simulation

  8. Visualization

  9. SciPy Integration

  10. Statistical Use Cases

  11. Pitfalls

  12. Summary

import math

import numpy as np
import plotly.graph_objects as go
import os
import plotly.io as pio

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

rng = np.random.default_rng(7)
np.set_printoptions(precision=6, suppress=True)

1) Title & Classification#

Name: multinomial (Multinomial distribution)
Type: Discrete (multivariate counts)

Support (for K categories and total count n):

\[ \mathcal{S}_{n,K} = \left\{ x \in \{0,1,2,\dots\}^K : \sum_{i=1}^K x_i = n \right\} \]

Parameter space:

  • \(n \in \{0,1,2,\dots\}\)

  • \(p=(p_1,\dots,p_K)\) with \(p_i\ge 0\) and \(\sum_{i=1}^K p_i = 1\)

  • typically \(K\ge 2\)

We write:

\[ X \sim \mathrm{Multinomial}(n, p),\qquad X=(X_1,\dots,X_K) \]

where \(X_i\) counts how many times category \(i\) occurred.

2) Intuition & Motivation#

What this distribution models#

A clean mental model is rolling a K-sided die n times. If the probability of face \(i\) is \(p_i\), and \(X_i\) counts how many times you saw face \(i\), then

\[ X = (X_1,\dots,X_K) \sim \mathrm{Multinomial}(n,p). \]

Key idea:

The multinomial is a multivariate generalization of the binomial: it counts how many times each category occurs.

Typical real-world use cases#

  • Survey / A-B-n experiments: how many users chose each option.

  • NLP (bag-of-words): word counts in a document given a fixed word-probability vector.

  • Quality control: counts of defect types.

  • Genetics: allele counts across categories.

  • Marketing / ads: impressions or clicks split across multiple creatives.

Relations to other distributions#

  • Categorical (a.k.a. multinoulli): if \(n=1\), then \(X\) is one-hot and you recover a categorical draw.

  • Binomial: if \(K=2\), then \(X_1\sim\mathrm{Bin}(n,p_1)\) and \(X_2=n-X_1\).

  • Binomial marginals: for any \(i\), \(X_i\sim\mathrm{Bin}(n,p_i)\) (but the components are dependent).

  • Poisson splitting: if \(Y_i\stackrel{\text{ind}}{\sim}\mathrm{Poisson}(\lambda p_i)\), then \((Y_1,\dots,Y_K)\mid \sum_i Y_i=n\) is multinomial \(\mathrm{Multinomial}(n,p)\).

  • Dirichlet–multinomial: if \(p\) is random with a Dirichlet prior and then \(X\mid p\) is multinomial, you get overdispersion.

3) Formal Definition#

Let \(X=(X_1,\dots,X_K)\) be a count vector with \(\sum_i X_i = n\).

PMF#

For \(x\in\mathcal{S}_{n,K}\):

\[ \Pr(X=x\mid n,p) = \frac{n!}{\prod_{i=1}^K x_i!}\,\prod_{i=1}^K p_i^{x_i}. \]

If \(x\notin\mathcal{S}_{n,K}\), the probability is 0.

A common shorthand is the multinomial coefficient:

\[ \binom{n}{x_1,\dots,x_K} = \frac{n!}{\prod_{i=1}^K x_i!}. \]

CDF (lower-orthant CDF)#

A standard multivariate analogue of a CDF is the lower-orthant probability:

\[ F(x) = \Pr(X_1\le x_1,\dots,X_K\le x_K) = \sum_{y\in\mathcal{S}_{n,K}:\; y\le x} \Pr(X=y), \]

where \(y\le x\) means componentwise inequality.

Notes:

  • There is no simple closed form for this CDF in general.

  • Because \(\sum_i X_i=n\) almost surely, you need \(\sum_i x_i\ge n\) to get a nonzero CDF.

  • For visualization, it’s often more informative to leave some components unconstrained, e.g.

\[ \Pr(X_1\le a,\;X_2\le b) = F(a,b,n,\dots,n). \]
def _validate_n(n):
    if isinstance(n, bool) or not isinstance(n, (int, np.integer)):
        raise TypeError("n must be an integer")
    n_int = int(n)
    if n_int < 0:
        raise ValueError("n must be >= 0")
    return n_int


def _validate_p(p):
    p = np.asarray(p, dtype=float)
    if p.ndim != 1:
        raise ValueError("p must be a 1D probability vector")
    if p.size < 2:
        raise ValueError("p must have length K>=2")
    if not np.all(np.isfinite(p)):
        raise ValueError("p must be finite")
    if np.any(p < 0):
        raise ValueError("p must be non-negative")

    s = float(p.sum())
    if not np.isclose(s, 1.0, atol=1e-12, rtol=0.0):
        raise ValueError(f"p must sum to 1 (got {s})")

    if s != 1.0:
        p = p / s
    return p


def _validate_counts(x, k, *, require_sum_n=None):
    x = np.asarray(x)
    if x.ndim == 1:
        x = x[None, :]
    if x.ndim != 2 or x.shape[1] != k:
        raise ValueError(f"x must have shape (k,) or (m,k) with k={k}")

    if not np.issubdtype(x.dtype, np.integer):
        if np.any(np.abs(x - np.round(x)) > 0):
            raise ValueError("x must contain integers")
        x = np.round(x).astype(int)
    else:
        x = x.astype(int)

    if np.any(x < 0):
        raise ValueError("x must be nonnegative")

    if require_sum_n is not None:
        row_sums = x.sum(axis=1)
        if np.any(row_sums != require_sum_n):
            raise ValueError("Each row of x must sum to n")

    return x


_lgamma_vec = np.vectorize(math.lgamma)


def multinomial_logpmf(x, n, p):
    n = _validate_n(n)
    p = _validate_p(p)
    x = _validate_counts(x, k=p.size, require_sum_n=n)

    if n == 0:
        out = np.zeros(x.shape[0], dtype=float)
        return out[0] if out.size == 1 else out

    log_coeff = math.lgamma(n + 1.0) - np.sum(_lgamma_vec(x + 1.0), axis=1)

    log_p = np.where(p > 0, np.log(p), -np.inf)
    # Interpret 0 * log(0) = 0 (limit).
    x_log_p = np.where(x == 0, 0.0, x * log_p[None, :])
    log_prob = np.sum(x_log_p, axis=1)

    out = log_coeff + log_prob
    return out[0] if out.size == 1 else out


def multinomial_pmf(x, n, p):
    return np.exp(multinomial_logpmf(x, n=n, p=p))


def compositions(n, k):
    # Generate all k-tuples of nonnegative integers summing to n (stars and bars).
    if k == 1:
        yield (n,)
        return
    for i in range(n + 1):
        for tail in compositions(n - i, k - 1):
            yield (i,) + tail


def enumerate_support(n, k):
    n = _validate_n(n)
    if k < 1:
        raise ValueError("k must be >= 1")
    return np.array(list(compositions(n, k)), dtype=int)


def multinomial_cdf_small_n(x, n, p):
    # Lower-orthant CDF by brute-force summation (only feasible for small n,k).
    n = _validate_n(n)
    p = _validate_p(p)
    x = _validate_counts(x, k=p.size)[0]

    ys = enumerate_support(n=n, k=p.size)
    mask = np.all(ys <= x[None, :], axis=1)
    return float(np.sum(multinomial_pmf(ys[mask], n=n, p=p)))


def simplex_xy_3(counts):
    # Map 3-category compositions to 2D barycentric coordinates for plotting.
    counts = np.asarray(counts, dtype=float)
    counts = np.atleast_2d(counts)
    if counts.shape[1] != 3:
        raise ValueError("simplex_xy_3 expects shape (m,3)")

    n = counts.sum(axis=1)
    if np.any(n <= 0):
        raise ValueError("All rows must sum to a positive n")

    p = counts / n[:, None]
    x = p[:, 1] + 0.5 * p[:, 2]
    y = (np.sqrt(3) / 2.0) * p[:, 2]
    return x, y
# Quick sanity check: PMF sums to 1 on the support
n = 6
p = np.array([0.2, 0.3, 0.5])

support = enumerate_support(n=n, k=p.size)
pmf = multinomial_pmf(support, n=n, p=p)

{
    "support_size": int(support.shape[0]),
    "pmf_sum": float(pmf.sum()),
    "min_pmf": float(pmf.min()),
    "max_pmf": float(pmf.max()),
}
{'support_size': 28,
 'pmf_sum': 1.000000000000001,
 'min_pmf': 6.400000000000006e-05,
 'max_pmf': 0.13500000000000018}

4) Moments & Properties#

Let \(X\sim\mathrm{Multinomial}(n,p)\) with \(p\in\mathbb{R}^K\) and \(\sum_i p_i=1\).

Mean and covariance#

For each component:

\[ \mathbb{E}[X_i] = n p_i. \]

The variance and covariance are:

\[ \mathrm{Var}(X_i) = n p_i(1-p_i),\qquad \mathrm{Cov}(X_i,X_j) = -n p_i p_j\quad (i\ne j). \]

Equivalently, the covariance matrix is

\[ \Sigma = n\,(\mathrm{diag}(p) - p p^\top). \]

A key qualitative property is negative dependence: if one category count is high, others must (on average) be lower because the total is fixed.

Skewness and kurtosis (marginals)#

Each component \(X_i\) is marginally binomial: \(X_i\sim\mathrm{Bin}(n,p_i)\). Therefore, for each \(i\):

\[ \text{skew}(X_i) = \frac{1-2p_i}{\sqrt{n p_i(1-p_i)}}, \qquad \text{excess kurt}(X_i) = \frac{1-6p_i(1-p_i)}{n p_i(1-p_i)}. \]

(These are univariate skewness/kurtosis of the marginals; multivariate notions also exist.)

MGF and characteristic function#

For a vector \(t\in\mathbb{R}^K\), the multivariate moment generating function is

\[ M_X(t) = \mathbb{E}[e^{t^\top X}] = \left(\sum_{i=1}^K p_i e^{t_i}\right)^n. \]

The characteristic function (\(\omega\in\mathbb{R}^K\)) is

\[ \varphi_X(\omega) = \mathbb{E}[e^{i\,\omega^\top X}] = \left(\sum_{i=1}^K p_i e^{i\,\omega_i}\right)^n. \]

Entropy#

The entropy is

\[ H(X) = -\sum_{x\in\mathcal{S}_{n,K}} \Pr(X=x)\,\log\Pr(X=x), \]

which generally has no simple closed form.

For large \(n\) (and all \(p_i>0\)), a useful approximation comes from a multivariate normal approximation on the \((K-1)\)-dimensional simplex:

\[ H(X)\;\approx\;\tfrac12\,\log\Bigl((2\pi e)^{K-1}\,n^{K-1}\,\prod_{i=1}^K p_i\Bigr). \]

Other useful properties#

  • Sum constraint: \(\sum_i X_i = n\) almost surely, so the covariance matrix has rank \(K-1\).

  • Merging categories: if you merge two categories, you get another multinomial with merged probability.

  • Conditionals: given some counts, the remaining counts are multinomial (or binomial in the sequential decomposition).

# Verify mean/covariance with Monte Carlo (using NumPy's built-in multinomial sampler)

def mean_cov_multinomial(n, p):
    n = _validate_n(n)
    p = _validate_p(p)
    mean = n * p
    cov = n * (np.diag(p) - np.outer(p, p))
    return mean, cov


n = 40
p = np.array([0.15, 0.35, 0.20, 0.30])

mean_theory, cov_theory = mean_cov_multinomial(n, p)

samples = rng.multinomial(n, p, size=200_000)
mean_mc = samples.mean(axis=0)
cov_mc = np.cov(samples.T, ddof=0)

{
    "mean_theory": mean_theory,
    "mean_mc": mean_mc,
    "max_abs_mean_err": float(np.max(np.abs(mean_mc - mean_theory))),
    "max_abs_cov_err": float(np.max(np.abs(cov_mc - cov_theory))),
}
{'mean_theory': array([ 6., 14.,  8., 12.]),
 'mean_mc': array([ 6.004515, 13.99794 ,  7.99156 , 12.005985]),
 'max_abs_mean_err': 0.008440000000000225,
 'max_abs_cov_err': 0.03195538522499941}

5) Parameter Interpretation#

n (total count)#

  • Controls the scale of the counts.

  • As n increases with fixed p, the distribution concentrates around its mean \(np\) and (after centering/scaling) becomes close to multivariate normal.

p (category probabilities)#

  • Controls the direction of the mean and where the mass sits on the simplex.

  • If one \(p_i\) is large, most mass is near the corresponding vertex (most trials land in that category).

  • If p is close to uniform, mass is concentrated near the center (counts are balanced).

For K=3: points \(x=(x_1,x_2,x_3)\) lie on a triangle (the 2-simplex) because \(x_1+x_2+x_3=n\).

# How p changes the shape (K=3): PMF on the simplex

def plot_simplex_pmf(n, p, *, title):
    p = _validate_p(p)
    if p.size != 3:
        raise ValueError("This visualization expects K=3")

    support = enumerate_support(n=n, k=3)
    pmf = multinomial_pmf(support, n=n, p=p)
    sx, sy = simplex_xy_3(support)

    tri_x = [0.0, 1.0, 0.5, 0.0]
    tri_y = [0.0, 0.0, float(np.sqrt(3) / 2.0), 0.0]

    fig = go.Figure()
    fig.add_trace(go.Scatter(x=tri_x, y=tri_y, mode="lines", line=dict(color="black"), showlegend=False))
    fig.add_trace(
        go.Scatter(
            x=sx,
            y=sy,
            mode="markers",
            marker=dict(
                size=10,
                color=pmf,
                colorscale="Viridis",
                colorbar=dict(title="PMF"),
                line=dict(width=0.2, color="rgba(0,0,0,0.2)"),
            ),
            text=[f"x={tuple(row)}, pmf={val:.3e}" for row, val in zip(support, pmf)],
            hoverinfo="text",
            showlegend=False,
        )
    )

    fig.update_layout(
        title=title,
        xaxis_title="barycentric x",
        yaxis_title="barycentric y",
        xaxis=dict(range=[-0.05, 1.05], zeroline=False),
        yaxis=dict(
            scaleanchor="x",
            scaleratio=1,
            range=[-0.05, float(np.sqrt(3) / 2.0) + 0.05],
            zeroline=False,
        ),
    )

    fig.add_annotation(x=0.0, y=0.0, text="cat 1", showarrow=False, yshift=-12)
    fig.add_annotation(x=1.0, y=0.0, text="cat 2", showarrow=False, yshift=-12)
    fig.add_annotation(x=0.5, y=float(np.sqrt(3) / 2.0), text="cat 3", showarrow=False, yshift=12)

    fig.show()


n = 18
plot_simplex_pmf(n, p=[1 / 3, 1 / 3, 1 / 3], title="PMF on simplex (n=18, p uniform)")
plot_simplex_pmf(n, p=[0.70, 0.20, 0.10], title="PMF on simplex (n=18, p favors cat 1)")
plot_simplex_pmf(n, p=[0.10, 0.15, 0.75], title="PMF on simplex (n=18, p favors cat 3)")

6) Derivations#

A convenient derivation uses one-hot indicator variables.

Let \(Z_t\in\{e_1,\dots,e_K\}\) be the one-hot vector for trial \(t\), with

\[ \Pr(Z_t=e_i)=p_i, \]

and define counts as

\[ X = \sum_{t=1}^n Z_t,\qquad X_i = \sum_{t=1}^n Z_{t,i}. \]

Expectation#

By linearity of expectation:

\[ \mathbb{E}[X_i] = \sum_{t=1}^n \mathbb{E}[Z_{t,i}] = \sum_{t=1}^n p_i = n p_i. \]

Variance and covariance#

For a fixed trial \(t\), the indicators satisfy:

\[ \mathbb{E}[Z_{t,i}] = p_i, \qquad \mathrm{Var}(Z_{t,i}) = p_i(1-p_i), \qquad \mathrm{Cov}(Z_{t,i}, Z_{t,j}) = -p_i p_j\;(i\ne j) \]

because \(Z_{t,i}Z_{t,j}=0\) when \(i\ne j\).

Across trials \(t\ne s\), independence gives zero covariance. Therefore:

\[ \mathrm{Var}(X_i) = \sum_{t=1}^n \mathrm{Var}(Z_{t,i}) = n p_i(1-p_i), \]

and for \(i\ne j\):

\[ \mathrm{Cov}(X_i,X_j) = \sum_{t=1}^n \mathrm{Cov}(Z_{t,i}, Z_{t,j}) = -n p_i p_j. \]

Likelihood and MLE#

For one observed count vector \(x\in\mathcal{S}_{n,K}\), the likelihood (as a function of \(p\)) is

\[ L(p\mid x) = \frac{n!}{\prod_i x_i!}\,\prod_{i=1}^K p_i^{x_i}. \]

The log-likelihood (dropping constants independent of \(p\)) is

\[ \ell(p) = \sum_{i=1}^K x_i\log p_i \quad\text{subject to}\quad \sum_i p_i = 1,\;p_i\ge 0. \]

Using a Lagrange multiplier for the constraint gives the MLE:

\[ \hat p_i = \frac{x_i}{n}. \]

For multiple independent observations \(x^{(1)},\dots,x^{(m)}\) (possibly with different totals \(n_j\)), the MLE becomes

\[ \hat p_i = \frac{\sum_{j=1}^m x_i^{(j)}}{\sum_{j=1}^m n_j}. \]
# MLE for p from multinomial samples

def fit_multinomial_mle(counts):
    counts = np.asarray(counts)
    if counts.ndim == 1:
        counts = counts[None, :]
    if counts.ndim != 2:
        raise ValueError("counts must have shape (k,) or (m,k)")
    if np.any(counts < 0):
        raise ValueError("counts must be nonnegative")

    totals = counts.sum(axis=1)
    if np.any(totals == 0):
        raise ValueError("each observation must have positive total count")

    total_counts = counts.sum(axis=0)
    return total_counts / total_counts.sum()


n = 25
p_true = np.array([0.10, 0.25, 0.05, 0.20, 0.40])

data = rng.multinomial(n, p_true, size=5_000)
p_hat = fit_multinomial_mle(data)

{
    "p_true": p_true,
    "p_hat": p_hat,
    "L1_error": float(np.sum(np.abs(p_hat - p_true))),
}
{'p_true': array([0.1 , 0.25, 0.05, 0.2 , 0.4 ]),
 'p_hat': array([0.100496, 0.250112, 0.049712, 0.199728, 0.399952]),
 'L1_error': 0.0012160000000000712}

7) Sampling & Simulation#

Below are two NumPy-only strategies.

A) Count categorical draws#

  1. Draw \(n\) iid categorical outcomes with probabilities \(p\).

  2. Count how many times each category appears.

This is the definition of the model, but it costs \(O(n)\) work per sample.

B) Sequential binomials (conditional decomposition)#

A very useful identity is that a multinomial can be generated as a sequence of conditional binomials:

\[ X_1\sim\mathrm{Bin}(n, p_1) \]
\[ X_2\mid X_1\sim\mathrm{Bin}\left(n-X_1, \frac{p_2}{1-p_1}\right) \]

…and so on, with the last component determined by the remaining count.

This is often faster than simulating all \(n\) categorical trials when n is large.

def multinomial_rvs_categorical_counts(n, p, size=1, *, rng: np.random.Generator):
    n = _validate_n(n)
    p = _validate_p(p)
    k = p.size

    if n == 0:
        out = np.zeros((size, k), dtype=int)
        return out[0] if size == 1 else out

    draws = rng.choice(k, size=(size, n), p=p)
    out = np.empty((size, k), dtype=int)
    for i in range(size):
        out[i] = np.bincount(draws[i], minlength=k)
    return out[0] if size == 1 else out


def multinomial_rvs_sequential_binomial(n, p, size=1, *, rng: np.random.Generator):
    n = _validate_n(n)
    p = _validate_p(p)
    k = p.size

    if n == 0:
        out = np.zeros((size, k), dtype=int)
        return out[0] if size == 1 else out

    out = np.zeros((size, k), dtype=int)
    remaining_n = np.full(size, n, dtype=int)

    remaining_prob = 1.0
    for i in range(k - 1):
        if p[i] == 0.0:
            xi = np.zeros(size, dtype=int)
        elif remaining_prob == p[i]:
            xi = remaining_n.copy()
        else:
            pi_cond = p[i] / remaining_prob
            xi = rng.binomial(remaining_n, pi_cond)

        out[:, i] = xi
        remaining_n = remaining_n - xi
        remaining_prob = remaining_prob - p[i]

    out[:, -1] = remaining_n
    return out[0] if size == 1 else out


# Compare the two NumPy-only samplers
n = 30
p = np.array([0.2, 0.1, 0.3, 0.4])
size = 100_000

s1 = multinomial_rvs_categorical_counts(n, p, size=size, rng=rng)
s2 = multinomial_rvs_sequential_binomial(n, p, size=size, rng=rng)

mean_theory, cov_theory = mean_cov_multinomial(n, p)

summary = {
    "max_abs_mean_err_categorical": float(np.max(np.abs(s1.mean(axis=0) - mean_theory))),
    "max_abs_mean_err_sequential": float(np.max(np.abs(s2.mean(axis=0) - mean_theory))),
    "max_abs_cov_err_categorical": float(np.max(np.abs(np.cov(s1.T, ddof=0) - cov_theory))),
    "max_abs_cov_err_sequential": float(np.max(np.abs(np.cov(s2.T, ddof=0) - cov_theory))),
}
summary
{'max_abs_mean_err_categorical': 0.019259999999999167,
 'max_abs_mean_err_sequential': 0.012549999999999173,
 'max_abs_cov_err_categorical': 0.06312542559999734,
 'max_abs_cov_err_sequential': 0.05956750249998599}

8) Visualization#

We’ll visualize:

  • the PMF for K=3 on the simplex (triangle)

  • a useful 2D CDF slice: \(\Pr(X_1\le a, X_2\le b)\) (leaving the third count unconstrained)

  • Monte Carlo samples on the simplex

# PMF on the simplex (K=3)
n = 20
p = np.array([0.25, 0.50, 0.25])

support = enumerate_support(n=n, k=3)
pmf = multinomial_pmf(support, n=n, p=p)

sx, sy = simplex_xy_3(support)

tri_x = [0.0, 1.0, 0.5, 0.0]
tri_y = [0.0, 0.0, float(np.sqrt(3) / 2.0), 0.0]

fig = go.Figure()
fig.add_trace(go.Scatter(x=tri_x, y=tri_y, mode="lines", line=dict(color="black"), showlegend=False))
fig.add_trace(
    go.Scatter(
        x=sx,
        y=sy,
        mode="markers",
        marker=dict(size=10, color=pmf, colorscale="Viridis", colorbar=dict(title="PMF")),
        text=[f"x={tuple(row)}, pmf={val:.3e}" for row, val in zip(support, pmf)],
        hoverinfo="text",
        showlegend=False,
    )
)
fig.update_layout(
    title=f"Multinomial PMF on the simplex (n={n}, p={p.tolist()})",
    xaxis_title="barycentric x",
    yaxis_title="barycentric y",
    xaxis=dict(range=[-0.05, 1.05], zeroline=False),
    yaxis=dict(scaleanchor="x", scaleratio=1, range=[-0.05, float(np.sqrt(3) / 2.0) + 0.05], zeroline=False),
)
fig.show()
# 2D CDF slice: P(X1 <= a, X2 <= b) for K=3
# This equals the lower-orthant CDF evaluated at (a, b, n).

n = 20
p = np.array([0.25, 0.50, 0.25])

support = enumerate_support(n=n, k=3)
pmf = multinomial_pmf(support, n=n, p=p)

cdf = np.zeros((n + 1, n + 1), dtype=float)
for a in range(n + 1):
    for b in range(n + 1):
        mask = (support[:, 0] <= a) & (support[:, 1] <= b)
        cdf[a, b] = pmf[mask].sum()

fig = go.Figure(
    data=go.Heatmap(
        z=cdf,
        x=np.arange(n + 1),
        y=np.arange(n + 1),
        colorscale="Blues",
        colorbar=dict(title="P(X1≤a, X2≤b)"),
    )
)
fig.update_layout(
    title=f"2D CDF slice: P(X1≤a, X2≤b) (n={n}, p={p.tolist()})",
    xaxis_title="b (upper bound for X2)",
    yaxis_title="a (upper bound for X1)",
)
fig.show()
# Monte Carlo samples on the simplex (K=3)

n = 20
p = np.array([0.25, 0.50, 0.25])

samples = multinomial_rvs_sequential_binomial(n, p, size=25_000, rng=rng)
x, y = simplex_xy_3(samples)

tri_x = [0.0, 1.0, 0.5, 0.0]
tri_y = [0.0, 0.0, float(np.sqrt(3) / 2.0), 0.0]

fig = go.Figure()
fig.add_trace(go.Scatter(x=tri_x, y=tri_y, mode="lines", line=dict(color="black"), showlegend=False))
fig.add_trace(
    go.Scatter(
        x=x,
        y=y,
        mode="markers",
        marker=dict(size=4, opacity=0.15, color="rgba(31,119,180,1)"),
        showlegend=False,
    )
)

# Theoretical mean location
mean = n * p
mx, my = simplex_xy_3(mean)
fig.add_trace(
    go.Scatter(
        x=mx,
        y=my,
        mode="markers",
        marker=dict(size=12, color="red"),
        name="mean (theory)",
    )
)

fig.update_layout(
    title=f"Monte Carlo samples on simplex (n={n}, p={p.tolist()})",
    xaxis_title="barycentric x",
    yaxis_title="barycentric y",
    xaxis=dict(range=[-0.05, 1.05], zeroline=False),
    yaxis=dict(scaleanchor="x", scaleratio=1, range=[-0.05, float(np.sqrt(3) / 2.0) + 0.05], zeroline=False),
)
fig.show()

9) SciPy Integration#

SciPy provides a numerically robust implementation via scipy.stats.multinomial.

  • Supports pmf / logpmf, rvs, mean, cov, and entropy.

  • As of SciPy 1.15, multinomial does not expose:

    • a cdf method (multivariate CDFs are expensive)

    • a .fit() method

For CDFs and fitting, you typically implement problem-specific utilities:

  • CDFs by enumeration for small n,K or by approximation.

  • p estimation via the closed-form MLE \(\hat p_i = \tfrac{\sum_j x_i^{(j)}}{\sum_j n_j}\).

from scipy.stats import multinomial

n = 12
p = np.array([0.2, 0.3, 0.5])

support = enumerate_support(n=n, k=p.size)

pmf_scipy = multinomial.pmf(support, n=n, p=p)
pmf_ours = multinomial_pmf(support, n=n, p=p)

samples_scipy = multinomial.rvs(n=n, p=p, size=50_000, random_state=rng)

# SciPy doesn't have multinomial.cdf, but we can compute it for small n with enumeration.
example_cdf = multinomial_cdf_small_n([4, 4, 12], n=n, p=p)

{
    "pmf_sum_scipy": float(pmf_scipy.sum()),
    "pmf_sum_ours": float(pmf_ours.sum()),
    "max_abs_pmf_diff": float(np.max(np.abs(pmf_scipy - pmf_ours))),
    "mean_empirical": samples_scipy.mean(axis=0),
    "mean_theory": multinomial.mean(n=n, p=p),
    "entropy_scipy": float(multinomial.entropy(n=n, p=p)),
    "example_cdf_small_n": float(example_cdf),
}
{'pmf_sum_scipy': 0.9999999999999992,
 'pmf_sum_ours': 0.9999999999999982,
 'max_abs_pmf_diff': 2.220446049250313e-16,
 'mean_empirical': array([2.3973 , 3.60692, 5.99578]),
 'mean_theory': array([2.4, 3.6, 6. ]),
 'entropy_scipy': 3.5338043228652705,
 'example_cdf_small_n': 0.6555341562499992}
# "Fit" p: closed-form MLE on SciPy-generated data
n = 30
p_true = np.array([0.10, 0.25, 0.05, 0.20, 0.40])

data = multinomial.rvs(n=n, p=p_true, size=5_000, random_state=rng)
p_hat = fit_multinomial_mle(data)

{
    "p_true": p_true,
    "p_hat": p_hat,
    "L1_error": float(np.sum(np.abs(p_hat - p_true))),
}
{'p_true': array([0.1 , 0.25, 0.05, 0.2 , 0.4 ]),
 'p_hat': array([0.100627, 0.250553, 0.04976 , 0.200673, 0.398387]),
 'L1_error': 0.0037066666666667053}

10) Statistical Use Cases#

A) Hypothesis testing (goodness-of-fit)#

Given observed counts \(x\) and a null probability vector \(p^{(0)}\), a common test is a chi-square goodness-of-fit test.

  • Null: data comes from \(\mathrm{Multinomial}(n, p^{(0)})\)

  • Expected counts under the null: \(n p^{(0)}\)

Rule of thumb: chi-square approximations are best when expected counts aren’t too small.

B) Bayesian modeling (Dirichlet conjugacy)#

If

\[ p \sim \mathrm{Dirichlet}(\alpha),\qquad X\mid p \sim \mathrm{Multinomial}(n,p), \]

then the posterior is

\[ p\mid X=x \sim \mathrm{Dirichlet}(\alpha + x). \]

C) Generative modeling#

Multinomial likelihoods appear whenever you model count vectors given probabilities, e.g.

  • bag-of-words document models

  • discrete emissions in mixture models

  • naive Bayes classifiers

from scipy.stats import chisquare, power_divergence

# A) Chi-square goodness-of-fit for one multinomial observation
n = 200
p0 = np.array([0.2, 0.5, 0.3])

# simulate an observation under an alternative
x_obs = multinomial.rvs(n=n, p=[0.25, 0.45, 0.30], size=1, random_state=rng)

expected = n * p0
chi2 = chisquare(f_obs=x_obs, f_exp=expected)

# Likelihood ratio (G-test): lambda_=0 gives the log-likelihood ratio statistic
# (Different approximations behave differently when expected counts are small.)
gtest = power_divergence(f_obs=x_obs, f_exp=expected, lambda_=0)

{
    "x_obs": x_obs,
    "p0": p0,
    "chi2_stat": float(chi2.statistic),
    "chi2_pvalue": float(chi2.pvalue),
    "g_stat": float(gtest.statistic),
    "g_pvalue": float(gtest.pvalue),
}
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_1039728/2878626965.py in ?()
      7 # simulate an observation under an alternative
      8 x_obs = multinomial.rvs(n=n, p=[0.25, 0.45, 0.30], size=1, random_state=rng)
      9 
     10 expected = n * p0
---> 11 chi2 = chisquare(f_obs=x_obs, f_exp=expected)
     12 
     13 # Likelihood ratio (G-test): lambda_=0 gives the log-likelihood ratio statistic
     14 # (Different approximations behave differently when expected counts are small.)

~/miniconda3/lib/python3.12/site-packages/scipy/stats/_stats_py.py in ?(f_obs, f_exp, ddof, axis, sum_check)
   7501     Power_divergenceResult(statistic=array([3.5 , 9.25]), pvalue=array([0.62338763, 0.09949846]))
   7502 
   7503     For a more detailed example, see :ref:`hypothesis_chisquare`.
   7504     """  # noqa: E501
-> 7505     return _power_divergence(f_obs, f_exp=f_exp, ddof=ddof, axis=axis,
   7506                              lambda_="pearson", sum_check=sum_check)

~/miniconda3/lib/python3.12/site-packages/scipy/stats/_stats_py.py in ?(f_obs, f_exp, ddof, axis, lambda_, sum_check)
   7317                        f"frequencies must agree with the sum of the "
   7318                        f"expected frequencies to a relative tolerance "
   7319                        f"of {rtol}, but the percent differences are:\n"
   7320                        f"{relative_diff}")
-> 7321                 raise ValueError(msg)
   7322 
   7323     else:
   7324         # Ignore 'invalid' errors so the edge case of a data set with length 0

ValueError: For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1.4901161193847656e-08, but the percent differences are:
[0.225    0.010101 0.153846]
# B) Dirichlet conjugacy: posterior update and posterior mean

def dirichlet_rvs_numpy(alpha, size, *, rng: np.random.Generator):
    alpha = np.asarray(alpha, dtype=float)
    if alpha.ndim != 1 or alpha.size < 2:
        raise ValueError("alpha must be a 1D array with length >= 2")
    if np.any(alpha <= 0) or not np.all(np.isfinite(alpha)):
        raise ValueError("alpha must be positive and finite")

    g = rng.gamma(shape=alpha, scale=1.0, size=(size, alpha.size))
    return g / g.sum(axis=1, keepdims=True)


alpha_prior = np.array([1.0, 1.0, 1.0])  # uniform prior on 3-simplex
n = 50
p_true = np.array([0.2, 0.5, 0.3])

x = multinomial.rvs(n=n, p=p_true, size=1, random_state=rng)
alpha_post = alpha_prior + x

posterior_mean = alpha_post / alpha_post.sum()

{
    "x": x,
    "alpha_prior": alpha_prior,
    "alpha_post": alpha_post,
    "posterior_mean": posterior_mean,
}
{'x': array([[12, 25, 13]]),
 'alpha_prior': array([1., 1., 1.]),
 'alpha_post': array([[13., 26., 14.]]),
 'posterior_mean': array([[0.245283, 0.490566, 0.264151]])}
# C) Generative modeling example: bag-of-words counts

vocab = ["cat", "dog", "fish", "tree", "car"]
p_words = np.array([0.30, 0.25, 0.05, 0.20, 0.20])

n_words = 80
counts = multinomial.rvs(n=n_words, p=p_words, size=1, random_state=rng)

{w: int(c) for w, c in zip(vocab, counts)}
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[15], line 9
      6 n_words = 80
      7 counts = multinomial.rvs(n=n_words, p=p_words, size=1, random_state=rng)
----> 9 {w: int(c) for w, c in zip(vocab, counts)}

TypeError: only length-1 arrays can be converted to Python scalars

11) Pitfalls#

Invalid parameters#

  • n must be a non-negative integer.

  • p must be non-negative and sum to 1.

  • For the PMF, x must be a non-negative integer vector with sum exactly n.

Numerical issues#

  • Factorials explode quickly; compute PMFs in log-space when n is moderate/large.

  • Probabilities can underflow for rare events; prefer logpmf when comparing likelihoods.

Modeling issues#

  • The multinomial assumes independent trials with a fixed probability vector p. Overdispersion (extra variability across replicates) is common; a Dirichlet–multinomial can help.

  • Categories must be well-defined and mutually exclusive; if observations can belong to multiple labels, the model changes.

CDF gotchas#

  • Multivariate CDFs are not unique in the same way as 1D CDFs (different “orders”/definitions exist).

  • Even for the standard lower-orthant CDF, computation is usually expensive beyond small n,K.

12) Summary#

  • multinomial is a discrete multivariate distribution over count vectors \(x\in\mathbb{N}_0^K\) with \(\sum_i x_i=n\).

  • PMF: \(\frac{n!}{\prod_i x_i!}\prod_i p_i^{x_i}\); mean \(np\); covariance \(n(\mathrm{diag}(p)-pp^\top)\).

  • Each component is marginally binomial, but components are negatively correlated.

  • The MLE for p is the empirical frequency: \(\hat p_i = x_i/n\) (or pooled across observations).

  • For computation and simulation, prefer scipy.stats.multinomial and log-space methods; for CDFs you usually need custom code or approximations.

References

  • SciPy: scipy.stats.multinomial

  • Standard probability texts covering multinomial coefficients and multinomial sampling